Yelp Restaurant Rating¶

  • Data: yelp_academic_dataset_business.json from Yelp Dataset or Kaggle
  • Goal 1: predict restaurant ratings using attributes from their Yelp profiles
  • Purpose: businesses can identify the key factors that drive higher ratings and boost their popularity
  • Goal 2: build recommendation system for restaurants
In [1]:
%matplotlib inline

import math
import os
import random
import warnings
import numpy as np
import pandas as pd
import seaborn as sns
# import statsmodels.api as sm
import statsmodels.formula.api as smf
from scipy import sparse
from sklearn.cluster import KMeans, DBSCAN
from sklearn.metrics import confusion_matrix, mean_absolute_error as MAE
from sklearn.metrics.pairwise import cosine_similarity, linear_kernel, euclidean_distances
from sklearn.model_selection import GridSearchCV, train_test_split
from sklearn.preprocessing import MinMaxScaler
from sklearn.tree import DecisionTreeClassifier, DecisionTreeRegressor
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
import matplotlib.pyplot as plt


# Suppress warnings
warnings.filterwarnings('ignore')

# Configure plot aesthetics
plt.rcParams['figure.figsize'] = (10, 6)
plt.rcParams['figure.dpi'] = 300
plt.rcParams['font.size'] = 8
plt.rcParams['figure.titlesize'] = 12
plt.rcParams['axes.linewidth'] = 0.1
plt.rcParams['patch.linewidth'] = 0

Load and Process Data¶

In [2]:
# Load data
path = 'data/yelpRestaurant.parquet'
if not os.path.isfile(path):
    !python yelpRestaurant.py

# Drop unnecessary columns
cols = ['name', 'city', 'state', 'postal_code',
        'latitude', 'longitude', 'categories']
df = pd.read_parquet(path).drop(columns=cols)

# Summary of missing values
display(pd.DataFrame({'missing_prop': df.isnull().sum() / len(df),
                      'values': [df[col].fillna(value='missing').unique() 
                                 for col in df.columns]
                     }).sort_values('missing_prop', ascending=False))
missing_prop values
dessert 0.536736 [missing, False, True]
latenight 0.520321 [missing, False, True]
brunch 0.517498 [missing, False, True]
breakfast 0.495476 [missing, False, True]
lunch 0.494238 [missing, False, True]
dinner 0.474072 [missing, False, True]
Caters 0.332444 [True, False, missing]
NoiseLevel 0.326991 [missing, average, quiet, loud, very_loud]
BikeParking 0.314366 [True, False, missing]
divey 0.305259 [missing, False, True]
trendy 0.302668 [missing, False, True]
hipster 0.285093 [missing, False, True]
intimate 0.282985 [missing, False, True]
classy 0.271810 [missing, False, True]
touristy 0.271771 [missing, False, True]
WiFi 0.270824 [free, no, missing, paid]
romantic 0.266435 [missing, False, True]
upscale 0.258662 [missing, False, True]
RestaurantsAttire 0.246810 [missing, casual, formal, dressy]
casual 0.244296 [missing, False, True]
Alcohol 0.227649 [none, full_bar, beer_and_wine, missing]
GoodForKids 0.208140 [missing, True, False]
RestaurantsGoodForGroups 0.199169 [missing, True, False]
OutdoorSeating 0.190197 [False, True, missing]
HasTV 0.189888 [missing, True, False]
street 0.180375 [True, missing, False]
RestaurantsReservations 0.171365 [missing, False, True]
lot 0.165681 [False, missing, True]
validated 0.163766 [False, missing, True]
garage 0.160808 [False, missing, True]
RestaurantsPriceRange2 0.139656 [1, missing, 2, 3, 4]
valet 0.132135 [False, missing, True]
RestaurantsDelivery 0.128732 [False, True, missing]
BusinessAcceptsCreditCards 0.119354 [False, True, missing]
RestaurantsTakeOut 0.079041 [True, missing, False]
review_count 0.000000 [80, 6, 19, 10, 28, 100, 245, 205, 40, 20, 339...
stars 0.000000 [4.0, 2.0, 3.0, 1.5, 2.5, 4.5, 3.5, 5.0, 1.0]

Exploratory Data Analysis¶

In [3]:
# Count occurrences of each combination of 'stars' and 'RestaurantsPriceRange2'
counts = df.groupby(['stars', 'RestaurantsPriceRange2']).size().unstack(fill_value=0)

# Create a stacked bar plot
counts.plot(kind='bar', stacked=True, colormap='tab10', edgecolor='black')

# Configure plot aesthetics
plt.title('Restaurant Star Ratings and Price Range')
plt.xlabel('Stars')
plt.ylabel('Number of Restaurants')
plt.xticks(rotation=0)
plt.legend(title='Price Range', bbox_to_anchor=(1.05, 1), loc='upper left')

# Show plot
plt.tight_layout()
plt.show()
No description has been provided for this image
In [4]:
sns.countplot(data=df, x='RestaurantsPriceRange2', hue=df['RestaurantsPriceRange2'])
plt.title('Price Range Distribution')
plt.show()
No description has been provided for this image
In [5]:
sns.kdeplot(data=df, x='review_count')
plt.title('Number of Reviews Distribution')
plt.show()
No description has been provided for this image

Modeling¶

Split Data¶

In [6]:
# Treat missing values as a category
df = df.fillna('missing')

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(
    df.drop('stars', axis=1), df.stars, test_size=0.33, random_state=100)

Linear Regression¶

In [7]:
# Linear Regression
formula = 'stars ~ review_count'
for col in X_train.columns[2:]:
    formula += f" + C({col}, Treatment(reference='missing'))"

# Train linear regression model
ols = smf.ols(formula=formula, data=df.loc[X_train.index]).fit()
In [8]:
%%time
# One-hot encode training and test sets
X_train_OHE = pd.get_dummies(X_train)
X_test_OHE = pd.get_dummies(X_test)

# CART: Decision Tree Regressor with Cross-Validation
grid_values = {'ccp_alpha': np.linspace(0, 0.05, 50),
               'min_samples_leaf': [5],
               'min_samples_split': [20],
               'max_depth': [20]}

dtr_cv = GridSearchCV(DecisionTreeRegressor(random_state=100), param_grid=grid_values,
                      scoring='r2', cv=5, verbose=0).fit(X_train_OHE, y_train)

# Print best parameters and score
print(f'Best r2 score ({dtr_cv.best_score_:.4f}) at ccp_alpha ({dtr_cv.best_params_['ccp_alpha']:.4f})')

# Plot ccp alpha vs CV R2
plt.plot(dtr_cv.cv_results_['param_ccp_alpha'].data, dtr_cv.cv_results_['mean_test_score'])
plt.xlabel('ccp alpha')
plt.ylabel('CV R2')
plt.grid(True, which='both')
plt.show()
Best r2 score (0.2310) at ccp_alpha (0.0010)
No description has been provided for this image
CPU times: user 50.9 s, sys: 718 ms, total: 51.6 s
Wall time: 51.7 s

Performance¶

Both regression models performed poorly, as indicated by low $OSR^2$ scores and high MAE values. While linear regression slightly outperformed decision tree regression, the difference in performance was minimal, showing that neither model effectively captured the relationships in the data.

In [9]:
# Performance Evaluation
def OSR2(y_test, y_train, y_pred):
    SSE = np.sum((y_test - y_pred)**2)
    SST = np.sum((y_test - np.mean(y_train))**2)
    return (1 - SSE/SST)


ols_pred = ols.predict(X_test)
dtr_pred = dtr_cv.best_estimator_.predict(X_test_OHE)

# Display OSR2 and MAE for both models
display(pd.DataFrame({
    'OSR2': [OSR2(y_test, y_train, ols_pred), OSR2(y_test, y_train, dtr_pred)],
    'MAE': [MAE(y_test, ols_pred), MAE(y_test, dtr_pred)]
}, index=['ols', 'dtr']))
OSR2 MAE
ols 0.220011 0.574134
dtr 0.216014 0.574054

Improvement¶

Since the regression models did not perform well, we will attempt to improve our approach by using classification, where the target variable is whether the restaurant's rating is 4 stars or above. This shift to classification is motivated by the idea that businesses are likely more concerned with maintaining high ratings rather than predicting the exact star value.

In [10]:
# Convert to classification task for stars >= 4
y_train1 = (y_train >= 4).astype(np.int8)
y_test1 = (y_test >= 4).astype(np.int8)
df['stars'] = (df['stars'] >= 4).astype(np.int8)
print(f'Proportion of positives: {np.mean(y_train1):.4f}')
Proportion of positives: 0.4461
In [11]:
# Regression Thresholding
ols_pred1 = (ols_pred >= 4).astype(np.int8)
dtr_pred1 = (dtr_pred >= 4).astype(np.int8)
print(f'ols:\n{confusion_matrix(y_test1, ols_pred1)}\n')
print(f'dtr:\n{confusion_matrix(y_test1, dtr_pred1)}')
ols:
[[9064  415]
 [6110 1479]]

dtr:
[[8723  756]
 [5476 2113]]
In [12]:
# Logistic Regression
log = smf.logit(formula=formula, data=df.loc[X_train.index]).fit()
Optimization terminated successfully.
         Current function value: 0.588771
         Iterations 6
In [13]:
%%time
# CART CV
grid_values = {'ccp_alpha': np.linspace(0, 0.05, 50),
               'min_samples_leaf': [5],
               'min_samples_split': [20],
               'class_weight': [{0: 1, 1: 1}]}

# CART: Decision Tree Classifier with Cross-Validation
dtc_cv = GridSearchCV(DecisionTreeClassifier(random_state=100), param_grid=grid_values,
                      scoring='accuracy', cv=5, verbose=0).fit(X_train_OHE, y_train1)

# Print best parameters and score
print(f'Best r2 score ({dtc_cv.best_score_:.4f}) at ccp_alpha ({dtc_cv.best_params_['ccp_alpha']:.4f})')

# Plot ccp alpha vs CV R2
plt.plot(dtc_cv.cv_results_['param_ccp_alpha'].data, dtc_cv.cv_results_['mean_test_score'])
plt.xlabel('ccp alpha')
plt.ylabel('CV R2')
plt.grid(True, which='both')
plt.show()
Best r2 score (0.6658) at ccp_alpha (0.0010)
No description has been provided for this image
CPU times: user 54.3 s, sys: 1.1 s, total: 55.4 s
Wall time: 56.1 s

Performance Comparison¶

Based on the table, logistic regression and decision tree classifier performed better than the other models due to their higher accuracy and true positive rate (TPR). While the two models have a higher false positive rate (FPR), their TPRs are significantly better than those of the OLS and decision tree regression model, which show much lower TPRs (0.1949 and 0.2784, respectively). This indicates that logistic regression and decision tree classifier are more effective at identifying positive cases, even though they have a higher rate of false positives.

In [14]:
# Performance Comparison
preds = [np.repeat(y_train1.mode()[0], len(y_test)),
         ols_pred1, dtr_pred1,
         (log.predict(X_test) > 0.5).astype(np.int8),
         dtc_cv.best_estimator_.predict(X_test_OHE)
         ]

# Calculate Accuracy, TPR, FPR
accuracys, tprs, fprs = [], [], []
for pred in preds:
    tn, fp, fn, tp = confusion_matrix(y_test1, pred).ravel()
    accuracys.append((tp+tn) / len(y_test1))
    tprs.append(tp / (tp + fn))
    fprs.append(fp / (fp + tn))

# Display performance metrics
display(pd.DataFrame({
    'Accuracy': accuracys,
    'TPR': tprs,
    'FPR': fprs
}, index=['baseline', 'ols', 'dtr', 'log', 'dtc']))
Accuracy TPR FPR
baseline 0.555367 0.000000 0.000000
ols 0.617706 0.194887 0.043781
dtr 0.634872 0.278429 0.079755
log 0.685200 0.593886 0.241692
dtc 0.666276 0.624193 0.300032

Content Based Filtering¶

In [15]:
# Load data
df = pd.read_parquet(path).reset_index(drop=True)
states = df.state

# Combine restaurant info into a single string
df['restaurant'] = df[['name', 'city', 'state', 'postal_code']].agg(' '.join, axis=1)
df = df.drop(columns=['name', 'city', 'state', 'postal_code'])


# # Clean category data
# stop_words = set(stopwords.words('english'))
# def remove_stopwords(text):
#     """Remove stop words from a given text."""
#     word_tokens = word_tokenize(text)
#     filtered_text = [w for w in word_tokens if not w.lower() in stop_words]
#     return ' '.join(filtered_text)

# df['categories'] = (df['categories'].str.replace(r'[^a-zA-Z]', ' ', regex=True)
#                     # .str.replace(r'Food|Restaurant|Restaurants', '', regex=True)
#                     # .str.replace('Tex Mex', 'TexMex')
#                     .apply(remove_stopwords))

Location Based Clustering¶

For each restaurant, clustering is performed using their location coordinates (longitude and latitude). Based on the comparison, DBSCAN was selected as the clustering technique since it produced clearer and more distinct clusters. However, upon closer examination, some clusters included geographically distant U.S. states, which could be due to dataset inconsistencies. For now, these are disregarded, as some clusters may group neighboring states together effectively. Further validation would require additional data to assess the accuracy of these clusters.

In [16]:
# Perform clustering
df['kmeans'] = KMeans(random_state=100).fit(df[['latitude', 'longitude']]).labels_.astype(str)
df['dbscan'] = DBSCAN().fit(df[['latitude', 'longitude']]).labels_.astype(str)

# Plot clustering results
fig, axes = plt.subplots(1, 2, figsize=(12, 6), constrained_layout=True)
for ax, method, title in zip(axes, ['kmeans', 'dbscan'], ['KMeans - Not Good', 'DBSCAN - Good']):
    sns.scatterplot(data=df, x='longitude', y='latitude', hue=method, ax=ax)
    ax.set_title(title)
plt.show()
No description has been provided for this image
In [17]:
# Plot DBSCAN
clusters = df['dbscan'].unique()
n_cols = 3
n_rows = math.ceil(len(clusters) / n_cols)
fig, axes = plt.subplots(n_rows, n_cols, figsize=(15, 5 * n_rows), constrained_layout=True)

# Plot each cluster
for ax, cluster in zip(axes.flatten(), clusters):
    data = df[df['dbscan'] == cluster]
    sns.scatterplot(ax=ax, data=data, x='longitude', y='latitude')
    ax.set_title(f'Cluster {cluster} - {states.loc[data.index].unique()}')

plt.show()
No description has been provided for this image

Recommend Restaurants¶

In [18]:
att = df.drop(columns=['latitude', 'longitude', 'categories', 'restaurant', 'kmeans', 'dbscan'])

cluster_cat_sim = dict()
In [19]:
def recommend_restaurants(place, k, w=None, df=df, clus='dbscan', att=att):
    """
    Recommend top k restaurants based on a restaurant's location and category similarity.

    Parameters:
    ----------
    place : str or int; name of the restaurant (str) or index (int)
    k : int; number of top recommendations
    w : list of float; weights for [distance, categories, attributes], must sum to 1
    df : DataFrame; restaurant data
    clus : str; clustering method (default 'dbscan')
    att : DataFrame; restaurant attributes

    Returns:
    -------
        DataFrame; recommended restaurants with names, star ratings, review counts, and categories, similarity score

    """
    # Get restaurant index
    if isinstance(place, str):
        idx = df[df.restaurant == place].index[0] if place in df.restaurant.values else print(f"'{place}' not in df")
    elif isinstance(place, int):
        idx = place if place in df.index else print(f"index {place} out of range")
    else:
        return

    if sum(w) not in [1]:
        return 'weights (distance, categories, attributes) must sum to 1 or 3'
    
    
    if clus not in ['dbscan', 'kmeans']:
        return 'clus must be dbscan or kmeans'
    cluster = df[clus].loc[idx]
    filtered = df[df[clus] == cluster]
    
    
    # Compute similarity based on restaurant distance
    sim0 = euclidean_distances(filtered[['longitude','latitude']], 
                               [filtered.loc[idx, ['longitude', 'latitude']]])
    sim0 = MinMaxScaler((0.001, 1)).fit_transform(sim0)
    # inverse distance to emphasize nearest restaurants
    filtered['sim0'] = MinMaxScaler().fit_transform(1 / sim0)
        
    # Get similarity based on restaurant category
    if cluster not in cluster_cat_sim:
        X = (TfidfVectorizer(tokenizer=lambda x: x.split(', '), dtype=np.float32)
             .fit_transform(filtered['categories']))
        cluster_cat_sim[cluster] = linear_kernel(X, X)
    filtered['sim1'] = cluster_cat_sim[cluster][filtered.index.get_loc(idx)]

    # Subset restaurants based on highest similarity scores
    att = pd.get_dummies(att).loc[filtered.index]
    att.iloc[:, 2:] = att.iloc[:, 2:].astype(bool).astype(np.int32)
    
    # Compute similarity based on restaurant attributes
    filtered['sim2'] = cosine_similarity(sparse.csr_matrix(att.loc[idx].values), 
                                         sparse.csr_matrix(att.values)).flatten()
    # Combine similarity scores
    filtered['sim'] = sum(w[i] * filtered[f'sim{i}'] for i in range(3))
    
    return (filtered.sort_values('sim', ascending=False).iloc[:k+1]
            [['restaurant', 'stars', 'review_count', 'categories', 'sim']])

# place = 'St Honore Pastries Philadelphia PA 19107'
# place = 'West Side Kebab House Edmonton AB T5T 1K8'
place = 18722
display(recommend_restaurants(place=place, k=10, w=[0.3, 0.4, 0.3]))
restaurant stars review_count categories sim
18722 Filipino Fusion Bar & Grill Tampa FL 33615 3.0 39 Filipino, Restaurants 1.000000
21899 Mercel's Bake Shop Pinellas Park FL 33781 3.0 44 Restaurants, Filipino 0.698878
19333 Pinoy St. Petersburg FL 33705 3.5 35 Filipino, Restaurants 0.698512
31062 Authentic Famous Filipino Restaurant Largo FL ... 4.0 45 Filipino, Restaurants 0.697945
4918 Manila Eats Riverview FL 33578 4.5 35 Restaurants, Filipino 0.697023
25342 Chismis & co Tampa FL 33602 4.5 20 Restaurants, Filipino 0.694355
16578 Mata’s Philippine Cuisine Tampa FL 33614 4.0 51 Restaurants, Fast Food, Filipino 0.666842
34928 Noah's Filipino Restaurant & Catering Services... 4.0 8 Filipino, Restaurants 0.653983
49521 Tindahang Pinoy Pinellas Park FL 33782 4.5 45 Filipino, Restaurants, Food, Grocery 0.611972
49038 Tropical Smoothie Cafe Tampa FL 33615 4.0 17 Juice Bars & Smoothies, Sandwiches, Restaurant... 0.601058
3858 Filipiniana Philippine Café Tampa FL 33614 4.0 108 Filipino, Restaurants, Bubble Tea, Food 0.595297

Evaluation¶

In [20]:
def simulation(df, k, iterations, weights):
    results = {str(np.round(w, 2)): {'relevant': [], 'avg_top_sim': [], 'avg_sim': [], 
                    'avg_star': [], 'unmatched_cat': []} for w in weights}

    for i, place in enumerate(random.sample(list(df.index), iterations)):
        print(i, end=' ')
        
        for system, w in zip(list(results.keys()), weights):
            recs = recommend_restaurants(place, k, w).iloc[1:]
            
            # Find unmatched categories word from queried restaurant
            unmatched = set()
            queried = set(recs.categories.iloc[0].split(', '))
            for row in recs.categories[1:]:
                unmatched.update(set(row.split()) - queried)
                
            results[system]['relevant'].append(len(recs[recs.sim > 0.6]))
            results[system]['avg_top_sim'].append(recs[recs.sim > 0.6].sim.mean())
            results[system]['avg_sim'].append(recs.sim.mean())
            results[system]['avg_star'].append(recs.stars.mean())
            results[system]['unmatched_cat'].append(len(unmatched))

    return results

# Set parameters
k, iterations = 10, 200
weights = [[1/3, 1/3, 1/3], [0.5, 0.5, 0], [0.5, 0, 0.5], [0, 0.5, 0.5]] # [distance, category, attribute]

# Run simulation
simulation_results = simulation(df, k, iterations, weights)

# Convert to dataframe
results = []
for k, v in simulation_results.items():
    df = pd.DataFrame(v)
    df['weights'] = k
    results.append(df)
results = pd.concat(results, ignore_index=True)
display(results.head(2))
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 
relevant avg_top_sim avg_sim avg_star unmatched_cat weights
0 10 0.648650 0.648650 2.75 13 [0.33 0.33 0.33]
1 10 0.662194 0.662194 2.35 7 [0.33 0.33 0.33]

Comparison Table¶

In [21]:
results.groupby('weights').mean()
Out[21]:
relevant avg_top_sim avg_sim avg_star unmatched_cat
weights
[0. 0.5 0.5] 10.000 0.905551 0.905551 3.52250 11.795
[0.33 0.33 0.33] 6.360 0.651228 0.617104 3.52200 13.740
[0.5 0. 0.5] 5.475 0.725010 0.641151 3.50625 33.480
[0.5 0.5 0. ] 0.180 0.729045 0.446654 3.49800 13.330

Comparison Plot¶

In [22]:
# Create a figure and axes for subplots
fig, axs = plt.subplots(2, 2, figsize=(18, 10), constrained_layout=True)

# Flatten the 2D array of axes for easy iteration
axs = axs.flatten()

# List of metrics and titles for subplots
metrics = [
    ('relevant', 'Number of Relevant Recommendations (Similarity > 0.6)'),
    ('avg_top_sim', 'Average Similarity Score (Similarity > 0.6)'),
    ('avg_sim', 'Average Similarity Score'),
    ('unmatched_cat', 'Number of Unmatched Categories Based on Original Restaurant'),
    # ('avg_star', 'Average Star Rating (not useful)')
]

# Loop through each metric and create the KDE plot
for ax, (metric, title) in zip(axs, metrics):
    sns.kdeplot(data=results, x=metric, hue='weights', ax=ax, fill=True)
    ax.set_title(title)

# Show the plot
plt.show()
No description has been provided for this image

Summary Findings¶

We evaluated various weight combinations on the performance of a content-based filtering recommendation system, using the top 10 results. Weight combinations are represented as $[DCA]$, where uppercase letters indicate non-zero weights for distance, category, and attribute, evenly distributed. Metrics include the number of relevant recommendations, average top similarity, average overall similarity, and unmatched categories.

Comparison:

  • $[DCA]$ performed reasonably well across all metrics, yielding a high number of relevant recommendations and low unmatched categories, though its average similarity score was not great
  • $[dCA]$ provided many relevant recommendations with high top and overall similarity scores, and low unmatched categories
  • $[DcA]$ had fewer relevant recommendations but a higher top similarity score relative to overall similarity, with a large number of unmatched categories
  • $[DCa]$ generated the fewest relevant recommendations, with a wide gap between top and overall similarity, but low unmatched categories.

Fewer relevant recommendations led to greater variability between top and overall similarity scores. Placing higher weights on category and attribute increased similarity, while emphasizing distance limited geographic options and produced inconsistent results. The distribution of unmatched categories remained consistent when category weights were non-zero

Further tuning of weights and incorporating user feedback are necessary for refining the system and improving recommendation quality.

In [ ]: